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Abstract 

We consider the adaptive shortest-path routing problem in wireless networks under unknown and 
stochastically varying link states. In this problem, we aim to optimize the quality of communication 
between a source and a destination through adaptive path selection. Due to the randomness and uncer- 
tainties in the network dynamics, the quality of each link varies over time according to a stochastic 
process with unknown distributions. After a path is selected for communication, the aggregated quality 
of all links on this path (e.g., total path delay) is observed. The quality of each individual link is not 
observable. We formulate this problem as a multi-armed bandit with dependent arms. We show that by 
exploiting arm dependencies, a regret polynomial with network size can be achieved while maintaining 
the optimal logarithmic order with time. This is in sharp contrast with the exponential regret order with 
network size offered by a direct application of the classic MAB policies that ignore arm dependencies. 
Furthermore, our results are obtained under a general model of link-quality distributions (including 
heavy-tailed distributions) and find applications in cognitive radio and ad hoc networks with unknown 
and dynamic communication environments. 

I. Introduction 

We consider a wireless network where the quality state of each link is unknown and stochastically 
varying over time. Specifically, the state of each link is modeled as a random cost (or reward) evolving 
i.i.d. over time under an unknown distribution. At each time, a path from the source to the destination 
is selected as the communication route and the total end-to-end cost, given by the sum of the costs of 
all links on the path, is subsequently observed. The cost of each individual link is not observable. The 
objective is to design the optimal sequential path selection policy to minimize the long-term total cost. 
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A. Stochastic Online Learning based on Multi-Armed Bandit 

The above problem can be modeled as a generalized Multi-Armed Bandit (MAB) with dependent arms. 
In the classic MAB Hl-lH, there are N independent arms and a player needs to decide which arm to 
play at each time. An arm, when played, offers a random cost drawn from an unknown distribution. 
The performance of a sequential arm selection policy is measured by regret, defined as the difference 
in expected total cost with respect to the optimal policy in the ideal scenario of known cost models 
where the player always plays the best arm. The optimal regret was shown to be logarithmic with time 
in |2). Furthermore, since cost observations from one arm do not provide information about the cost 
of other arms, the optimal regret grows linearly with the number of arms. We can model the adaptive 
routing problem as an MAB by treating each path as an arm. The difference is that paths are dependent 
through shared links. While the dependency across paths can be ignored in learning and the policies for 
the classic MAB directly apply, such a naive approach yields poor performance with a regret growing 
linearly with the number of paths, thus exponentially with the network size (i.e., the number of links in 
the worst case). 

In this paper, we show that by exploiting the structure of the path dependencies, a regret polynomial 
with network size can be achieved while preserving its optimal logarithmic order with time. Specifically, 
we propose an algorithm that achieves an 0(d 3 log T) regret for all light-tailed cost distributions, where 
d < m is the dimension of the path set, m the number of links, and T the length of time horizon. We 
further show that a modification to the proposed algorithm leads to a regret linear with d by sacrificing an 
arbitrarily small regret order with time. This result allows a performance tradeoff in terms of the network 
size and the time horizon. For example, the algorithm with a smaller regret order in d can perform better 
over a finite time horizon when the network size is large. For heavy-tailed cost distributions, a regret 
linear with the number of edges and sublinear with time can be achieved. We point out that any regret 
sublinear with time implies the convergence of the time-average cost to the minimum one of the best 
path. 

We further generalize the adaptive routing problem to stochastic online linear optimization problems, 
where the action space is a compact subset of 1Z d and the random cost function is linear on the actions. 
We show that for all light-tailed cost functions, regret polynomial or linear with d and sublinear with 
time can be achieved. We point out that for cases where there exists a nonzero gap in the expected cost 
between the optimal action and the rest of the action space (e.g., when the action set is a polytope or 
finite), the same regret orders obtained for the adaptive routing problem can be achieved. 
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B. Applications 

One application example is adaptive routing in cognitive radio networks where secondary users com- 
municate by exploiting channels temporarily unoccupied by primary users. In this case, the availability 
of each link dynamically varies according to the communication activities of nearby primary users. The 
delay on each link can thus be modeled as a stochastic process unknown to the secondary users. The 
objective is to route through the path with the smallest latency (i.e., the lightest primary traffic) through 
stochastic online learning. 

Other applications include ad hoc networks where link states vary stochastically due to channel fading 
or random contentions with other users. 

C. Related Work 

The classic MAB was addressed by Lai and Robbins in 1985, where they showed that the minimum 
regret has a logarithmic order with time and proposed specific policies to asymptotically achieve the 
optimal regret for several cost distributions in the light-tailed family O. Since Lai and Robbins 's seminar 
work, simpler index policies were proposed for different classes of light-tailed cost distributions to achieve 
the logarithmic regret order with time O-Q. In O, we proposed the DSEE approach that achieves the 
logarithmic regret order with time for all light-tailed cost distributions and sublinear regrets with time for 
heavy-tailed cost distributions. All regrets established in O-Q grow linearly with the number of arms. 
This work builds upon the general DSEE structure proposed in 01 . By incorporating the exploitation of 
the arm dependencies into DSEE, we show that the learning efficiency can be significant improved in 
terms of network size while preserved in terms of time. 

This work generalizes the previous work Q-O that assumes fully observable link costs on a chosen 
path. In particular, the adaptive routing problem under this more informative observation model was 
considered in 10 and an algorithm was proposed to achieve 0(m 4 log T) regret, where m is the number 
of links in the network. The problems considered in this paper were also studied under an adversarial 
bandit model in which the cost functions are chosen by an adversary and are treated as arbitrary bounded 
deterministic quantities ifTOll . Algorithms were proposed to achieve regrets sublinear with time and 
polynomial with network size. The problem formulation and results established in this paper can be 
considered as a stochastic version of those in ifTOl . 

For the special class of stochastic online linear optimization problems where there exists a nonzero 
gap in expected cost between the optimal action and the rest of the action space and the cost has a finite 
support, an algorithm was proposed in irTTTl to achieve an 0(d 2 log 3 T) regret given that a nontrivial 
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lower bound on the gap is known. The algorithm proposed in this paper achieves a better regret in terms 
of both d and T without any knowledge on the cost model. For the general case with finite-support 
cost distributions, the regret was shown to be lower bounded by 0(dy/T) and an efficient algorithm was 
proposed to achieve an 0((dlnT) 3 / 2 VT) regret ifTTl . Compared to the algorithm in ifTTI . our algorithm 
for the general stochastic online linear optimization problems performs worse in T but better in d. 

II. Problem Statement 

In this section, we address the adaptive routing problem under unknown and stochastically time-varying 
link states. Consider a network with a source s and a destination r. Let G = (V, E) denote the directed 
graph consisting of all simple paths from s to r. Let m and n denote, respectively, the number of edges 
and vertices in graph G. 

At each time t, a random weight/cost W e {t) drawn from an unknown distribution is assigned to each 
edge in E. We assume that {W e (i)} are i.i.d. over time for each edge e. At the beginning of each time 
slot t, a path p n G V (i £ {1,2, . . . ,\V\}) from s to r is chosen, where V is the set of all paths from s 
to r in G. Subsequently, the total end-to-end cost Ct(p n ) of the path p n , given by the sum of the weights 
of all edges on the path, is revealed in the end of the slot. We point out that the individual cost on each 
edge e G E is unobservable. 

The regret of a path selection policy -k is defined as the expected extra cost incurred over time T 
compared to the optimal policy that always selects the best path {i.e., the path with the minimum expected 
total end-to-end cost). The objective is to design a path selection policy n to minimize the growth rate 
of the regret. Let a be a permutation on all paths such that 

E[C t (a(l))} < E[C t (*(l))] < E[C t (a(\V\))}. 

We define regret 

T 

K*(T)±E[J2(C t (7r)-C t (a(l)))], 
t=i 

where Ct(ir) denotes the total end-to-end cost of the selected path under policy it at time t. 

III. Adaptive Shortest Path Routing Algorithm 

In this section, we present the proposed adaptive shortest-path routing algorithm. We consider both 
light-tailed and heavy-tailed cost distributions. 
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A. Light-Tailed Cost Distributions 

We consider the light-tailed cost distributions as defined below. 

Definition 1: A random variable X is light-tailed if its moment-generating function exists, i.e., there 
exists a no > such that 

M{u) = E[exp(«X)] < oo V u < |u |; 

otherwise X is heavy-tailed. 

For a zero-mean light-tailed random variable X, we have lfT2~1 . 

Af(it) < exp(Cn 2 /2), Vm< |u |, C > sup{Af (2) (n), - u < u < u }, (1) 

where M^ 2 \-) denotes the second derivative of M(-) and uq the parameter specified in Definition Q] 
From (Q]), we have the following extended Chernoff-Hoeffding bound on the deviation of the sample 
mean from the true mean for light-tailed random variables O. 

Let {Xit)}^ be i.i.d. light-tailed random variables. Let JC S = (T, s t=1 X (t)) / s and 6 = EpT(l)]. We 
have, for all 5 £ [0,<>o],a € (0,1/(2C)], 

Pr(\X~ s -6\ >5)< 2exp(-a5 2 s). (2) 

Note that the path cost is the sum of the costs of all edges on the path. Since the number of edges in 
a path is upper bounded by m, the bound on the moment generating function in (Q]) holds on the path 
cost by replacing ( by m(, and so does the Chernoff-Hoeffding bound in (fSJ). 

In the following, we propose an algorithm to achieve a regret polynomial with m and logarithmic with 
T. We first represent each path p n as a vector p n with m entries consisting of Os and Is representing 
whether or not an edge is on the path. The vector space of all paths is embedded in a d-dimensional (d < 
m) subspacdj of lZ m . The cost on each path p n at time t is thus given by the linear function 

[W 1 (t),W 2 (t),...,W m (t))-Pn- 

The general structure of the algorithm follows the DSEE framework established in (6] for the classic 
MAB. More specifically, we partition time into an exploration sequence and an exploration sequence. 
In the exploration sequence, we sample the d basis vectors (barycentric spanner [10] as defined in the 
following) evenly to estimate the qualities of these vectors. In the exploitation sequence, we select the 
action estimated as the best by linearly interpolating the estimated qualities of the basis vectors. A detailed 
algorithm is given in Fig. Q] 



'if graph G is acyclic, then d — m — n + 2. 
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Definition 2: A set B = {x\, . . . ,Xd] is a barycentric spanner for a d-dimensional set A if every 
i£i can be expressed as a linear combination of elements of B using coefficients in [—1,1]. 

Note that a compact subset of lZ d always has a barycentric spanner, which can be constructed efficiently 
(see an algorithm in ifTUl ). For the problem at hand, the path set V is a compact subset of TZ d with 
dimension d < m. We can thus construct a barycentric spanner consisting of d paths in V by using the 
algorithm in ifTOll . 

Adaptive Shortest-Path Routing Algorithm 

• Notations and Inputs: Construct a barycentric spanner B = {pi,P2, ■ ■ ■ ,Pd} of the vector 
space of all paths. Let A(t) denote the set of time indices that belong to the exploration 
sequence up to (and including) time t and A(l) = {1}. Let \A(t)\ denote the cardinality 
of A(t). Let 9 Pi (t) denote the sample mean of path (i G {1, ...,d}) computed 
from the past cost observations on the path. For two positive integers k and I, define 
k I = ((k — 1) mod I) + 1, which is an integer taking values from 1, 2, • • • , I. 

• At time t, 

1. if t G A(t), choose path p n with n = \A{t)\ d; 

2. if t ^ A(t), estimate the sample mean of each path p n G V by linearly interpolating 
{8 Pl (t), 8 P2 (t), . . . ,0 Pd (t)}. Specifically, let {aj : |oj| < l}f =1 be the coefficients such 
that p n = Ya=i a iPh then 6 Pn (t) = Y?i=i <H9i(t). Choose path 

p* = argmin{£ p „(i), n = 1, 2, . . . , \V\}. 
Fig. 1. The general structure of the algorithm. 



Theorem 1: Construct an exploration sequence as follows. Let a,C,uo be the constants such that (f2]) 
holds on each edge cost. Choose a constant b > 2m /a, a constant 

CG(0 min MCtW)) - C t (a(l))}}), 

j:E[C t (aU))-Ct(<T(l))]>0 

and a constant w > max{b / (mdCuo) 2 , Ab/c 2 }. For each t > 1, if \A(t — 1)| < d[(i 2 w logt] , then include 
t in A(t). Under this exploration sequence, the resulting policy it* has regret 

R n * (T) < Amd 3 log T 

for some constant A independent of d, m and T > 1. 
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Proof: Since A(T) < d\d 2 wlogT~\, the regret caused in the exploration sequence is at the order 
of md 3 log T. Now we consider the regret caused in the exploitation sequence. Let Eje denote the kth 
exploitation period which is the kth contiguous segment in the exploitation sequence. Let E k denote the 
kth exploitation period. Similar to the proof of Theorem 3.1 in 01, we have 

\E k \< ht k (3) 

for some constant h independent of d and m. Let t k > 1 denote the starting time of the kth exploitation 
period. Next, we show that by applying the Chernoff-Hoeffding bound in (f2]) on the path cost, for any t 
in the kth exploitation period and i = 1, . . . , d, we have 

Pr(\E[CM] -e Pi (t)\ > c/(2d)) < 2t~ k ab/m . 

To show this, we define the parameter 6j(t) = sJb log t/ri(t), where n(t) is the number of times that 
path i has been sampled up to time t. From the definition of parameter b, we have 

€i{t) < m.m{m(u ,c/(2d)}. (4) 

Applying the Chernoff-Hoeffding bound, we arrive at 

Pr(\E[C t ( Pi )]-9 Pi (t)\ > c/(2d)) < Pr(|E[(7 t (pi)] -9 P M\ > ^)) < ■ 

In the exploitation sequence, the expected times that at least one path in B has a sample mean deviating 
from its true mean cost by c/(2d) is thus bounded by 

^ = i2dt~ ab/m t k < E^dt 1 -" 6 /™ = gd. (5) 

for some constant g independent of d and m. Based on the property of the barycentric spanner, the best 
path would not be selected in the exploitation sequence only if one of the basis vector in B has a sample 
mean deviating from its true mean cost by at least c/(2d). We thus proved the theorem. ■ 
In Theorem [T] we need a lower bound (parameter c) on the difference in the cost means of the best and 
the second best paths. We also need to know the bounds on parameters £ and no such that the Chernoff- 
Hoeffding bound (O holds. These bounds are required in defining w that specifies the minimum leading 
constant of the logarithmic cardinality of the exploration sequence necessary for identifying the best path. 
Similar to |5|, we can show that without any knowledge of the cost models, increasing the cardinality 
of the exploration sequence of ir* by an arbitrarily small amount leads to a regret linear with d and 
arbitrarily close to the logarithmic order with time. 
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Theorem 2: Let f(t) be any positive increasing sequence with f(t) — > oo as t — > oo. Revise policy 
7T* in Theorem [Das follows: include t (t > 1) in .4(i) if |^4(t - 1)| < d[/(i)logt]. Under the revised 
policy it', we have 

R*'(T) = 0(df(T) logT). 

Proof: It is sufficient to show that the regret caused in the exploitation sequence is bounded by 0(d), 
independent of T. Since the exploration sequence is denser than the logarithmic order as in Theorem [TJ 
it is not difficult to show that the bound on \Ek\ given in ® still holds with a different value of h. 

We consider any positive increasing sequence b(t) such that b(t) = o(f(t)) and b(t) — > oo as t — > oo. 
By replacing b in the proof of Theorem Q] with b(t), we notice that after some finite time To, the parameter 
ei(t) will be small enough to ensure (01) holds and b(t) will be large enough to ensure ([5]) holds. The 
proof thus follows. ■ 

B. Heavy-Tailed Cost Distributions 

We now consider the heavy-tailed cost distributions where the moment of the cost exists up to the q-ih 
(q > 1) order. From O, we have the following bound on the deviation of the sample mean from the true 
mean for heavy-tailed cost distributions. 

Let {X(t)}pLi t> e i-i-d- random variables drawn from a distribution with the q-th moment (q > 1). Let 

X t = (El =1 X(k))/t and 9 = E[X(1)]. We have, for all e > 0, 

Pr(|X t - 9\ > e) = o(t l ~i). (6) 

Theorem 3: Construct an exploration sequence as follows. Choose a constant v > 0. For each t > 1, 
if \A(t — 1)| < vt 1 ^, then include t in A(t). Under this exploration sequence, the resulting policy Tr q 
has regret i? 7r, (T) < DdT 1 ^ for some constant D independent of d and T. 

Proof: Based on the construction of the exploration sequence, it is sufficient to show that the regret 
in the exploitation sequence is o(T 1 / <7 ) • d. From ©, we have, for any i = 1, . . . , d, 

Pr(|E[C t (K)] -e Pi (t)\ > c/(2d)) = o(\A(t)\ l ~«). 

For any exploitation slot t € A(t), we have \A(t)\ > vtl/q. We arrive at 

Pv(\E[Ct(Pi)} -0 P M\ > c/(2d)) = oft 1 -**/*). 

Since the best path will not be chosen only if at least one of the basis vector has the sample deviating 
from the true mean by c/(2d), the regret in the exploitation sequence is thus bounded by 

T 

J2o(t {1 - q)/q )-d = o{T 1 / q )-d. 
t=i 
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IV. Generalization to Stochastic Online Linear Optimization Problems 

In this section, we consider the general Stochastic Online Linear Optimization (SOLO) problems that 
include adaptive routing as a special case. In a SOLO problem, there is a compact set U C lZ d with 
dimension d, referred to as the action set. At each time t, we choose a point x G U and observe a cost CfX 
where C t € Tl d is drawn from an unknown distributiony. We assume that {Ct}t>i are i.i.d. over t. The 
objective is to design a sequential action selection policy tt to minimize the regret TZ n (T), defined as the 
total difference in the expected cost over T slots compared to the optimal action x* = argmin{E[(7 t -x]}: 

T 

K*(T)^E[J2(Cfx(ir)-C t -x*)), 
t=i 

where x(tt) denotes the action under policy it at time t. 

As a special case, the action space of the adaptive routing problem consists of all path vectors with 
dimension d < m. The main difficulty in a general SOLO problem is on identifying the optimal action in 
an infinite action set. Our basic approach is to implement the shortest-path algorithm in Fig. Q] repeatedly 
for the chosen action to gradually converge to the optimal action. Specifically, the proposed algorithm 
runs under an epoch structure where each epoch simulates one round of the algorithm and the epoch 
lengths form a geometric progression. 

Theorem 4: Assume that the cost for each action is light-tailed. Let be the length of the fc-th 
epoch. Construct an exploration sequence in this epoch as follows. Let a, £, uq be the constants such 
that © holds on each cost. Choose a constant b > 2/ a, a constant c = log T^ 3 /T^ 3 , and a constant 
w > m&x{b/(d(u ) 2 ,4b/c 2 }. For each t > 1, if \A(t - 1)| < d\d 2 w log t] , then include t in A(t). The 
resulting policy ir 9 has regret 

iT 9 (T) = 0(d 3 T 2 / 3 log 1 / 3 T). 

We point out that the regret order can be improved to linear with d by sacrificing the order with T by 
an arbitrarily small amount, as similar to Theorem [2 

V. Conclusion 

In this paper, we considered the adaptive routing problem in networks with unknown and stochastically 
varying link states, where only the total end-to-end cost of a path is observable after the path is selected 

2 The following result holds in a more general scenario that only requires the expected cost is linear with x. 
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for routing. For both light-tailed and heavy-tailed link-state distributions, we proposed efficient online 
learning algorithms to minimize the regret in terms of both time and network size. The result was further 
extended to the general stochastic online liner optimization problems. Future work includes extending 
the i.i.d. cost evolution over time to more general stochastic processes. 
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